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Abstract —The estimation of the time- and frequency- 
dependent coherent-to-diffuse power ratio (CDR) from the mea¬ 
sured spatial coherence between two omnidirectional micro¬ 
phones is investigated. Known CDR estimators are formulated in 
a common framework, illustrated using a geometric interpreta¬ 
tion in the complex plane, and investigated with respect to bias 
and robustness towards model errors. Several novel unbiased 
CDR estimators are proposed, and it is shown that knowledge 
of either the direction of arrival (DOA) of the target source 
or the coherence of the noise field is sufficient for unbiased 
CDR estimation. The validity of the model for the application of 
CDR estimates to dereverberation is investigated using measured 
and simulated impulse responses. A CDR-based dereverberation 
system is presented and evaluated using signal-based quality 
measures as well as automatic speech recognition accuracy. 
The results show that the proposed unbiased estimators have 
a practical advantage over existing estimators, and that the 
proposed DOA-independent estimator can be used for effective 
blind dereverberation. 

Index Terms —Spatial Coherence, Diffuse Noise Suppression, 
Diffuseness, Dereverberation, Reverberation Suppression 

1. Introduction 

I T has been observed as early as 1969 that the measured 
spatial coherence between two microphones allows the 
discrimination between direct sound and reverberation [1]. A 
first signal enhancement algorithm based on this observation 
was proposed by Allen et al. in 1977 [2], where the magnitude 
of the coherence is estimated in the Short-Time Fourier 
Transform (STFT) domain and used as a gain for reverberation 
suppression. Other heuristic methods for noise reduction and 
dereverberation using coherence estimates have since been 
proposed [3]-[7]. Related methods have also been investigated 
for noise suppression in connection with beamforming, and 
postfilters which are statistically optimal under certain condi¬ 
tions have been proposed for the suppression of uncorrelated 
[8] and diffuse [9] noise. 

More recently, explicit estimators for the ratio between 
direct and diffuse signal components, termed the coherent- 
to-diffuse power ratio (CDR), from short-time coherence es¬ 
timates have been formulated [10], [11], based on the same 
assumptions as the earlier optimum postfilter derivations [9]. 
Also, results have since been generalized from omnidirectional 
microphones to other microphone directivities [12], [13] and 
spherical microphone arrays [14]. While these estimates can be 
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used for the formulation of postfilters for signal enhancement 
[15], which is the main application considered in this contribu¬ 
tion, short-time CDR estimates (or the equivalent “diffuseness” 
measure) also have applications in parametric coding of spatial 
audio signals [16] and the extraction of spatial features for 
automatic speech recognition (ASR) [17]. 

In this contribution, the estimation of the CDR from 
the measured coherence between two omnidirectional micro¬ 
phones, and the application of the CDR estimates to derever¬ 
beration, is investigated. First, the signal model for the record¬ 
ing of a noisy or reverberant signal with two omnidirectional 
microphones is described, the relationship between signal and 
noise coherence models and the coherence of the mixed signal 
is given, and coherence models for the application to derever¬ 
beration are discussed. Then, several known CDR estimators 
are formulated in a common framework, illustrated using a 
geometric interpretation in the complex plane, and improved 
unbiased estimators are proposed. It is shown that knowledge 
of either the target signal direction or the noise coherence 
is sufficient for an unbiased CDR estimation, and estimators 
are proposed for the cases of unknown target signal direction 
and unknown noise coherence. Finally, the CDR estimators 
are applied in a postfilter for reverberation suppression and 
evaluated by processing reverberant speech and comparing 
ASR recognition accuracy as well as various signal quality 
measures. This paper builds on results published in a recent 
conference paper by the same authors, in which the novel 
estimators were initially proposed [15]. 

II. Signal Model 

We consider the recording of a reverberant or noisy speech 
signal by two omnidirectional microphones with a spacing d, 
located in the same horizontal plane. The signal Xi{t) of the 
i-th microphone is composed of a desired signal component 
Si{t) and an undesired component ni{t) consisting of noise 
and/or late reverberation, i.e., 

Xi{t) = Si{t) + ni{t), i=l,2. (1) 

The microphone, desired and noise signals are represented 
in the time-frequency (STFT) domain by the corresponding 
uppercase letters, i.e., X^(/,/), 5'^(/,/) and A^(/,/), respec¬ 
tively, with the discrete-time frame index I and continuous 
frequency /, and are assumed to be short-time stationary. 
Using the representation in the STFT domain, the short-time 
auto- and cross-power spectra between two signals u{t) and 
v{t) are defined as 

^uv{l,f)=£{U{l,f)V*il,f)}, (2) 
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where £ is the expectation operator. It is assumed that the 
auto-power spectra of the signal components are the same at 
both microphones, i.e., 

(3) 

= $„,„,(«,/) = $„(;,/). (4) 

Note that this assumption is generally appropriate for a plane 
wave as desired signal as well as for noise and late rever¬ 
beration, but may in practice be impacted by the presence of 
early reflections causing destructive or constructive interfer¬ 
ence. The time- and frequency-dependent signal-to-noise ratio 
(SNR) of the microphone signals can be defined as 


A. Desired Signal 

The desired signal component is modeled as a plane wave 
with the direction of arrival (DOA) 0 with respect to the micro¬ 
phone axis, where 0 = 0° corresponds to broadside direction. 
The corresponding time-invariant coherence function is given 
by 

y /) _ jkdsin(e) _ j 27 TfAt 

MU) ~ ~ ’ ^ ^ 

with the time difference of arrival (TDOA) At = dsm{0)/c, 
the wavenumber k = 27 r//c and the speed of sound c. This 
coherence function always has a magnitude of one, and is 
equal to one for At = 0. 


SNR{l,f) 


MIJ) 

Ml,/)' 


(5) 


The complex spatial coherence functions of the desired signal 
and noise components are given by 


r.(/) 


MsUf) 


^nin2 /) 


( 6 ) 


respectively, and are assumed to be time-invariant, i.e., de¬ 
pendent only on the spatial characteristics of the signal com¬ 
ponents. It is furthermore assumed that signal and noise are 
mutually orthogonal, such that 


B. Reverberation as Isotropic Sound Field 

In array signal processing, environmental noise is often 
modeled by the superposition of an infinite number of un¬ 
correlated, spatially distributed noise sources. In applications 
like underwater acoustics or radio communication, this model 
is motivated by the presence of many independent noise and 
interfering sources around the receiver [19]. The most common 
assumption for the spatial distribution is a sphere centered 
around the receiver, which corresponds to what is known 
as a diffuse or spherically isotropic noise field. The spatial 
coherence function between two omnidirectional sensors in a 
diffuse noise field is real-valued and given by 


= (7) 


The complex spatial coherence of the mixed sound field can 
then be written as a function of the SNR and the signal and 
noise coherence functions: 


r.(/,/) 


SNR{lJ)Vs{f)FVn{f) 

SNR{lJ)Fl 


( 8 ) 


This relationship is valid for any signal and noise coherence 
function. For the special case of a fully coherent desired 
signal component and diffuse noise, the term CDR or direct- 
to-diffuse ratio (DDR) is often used for the SNR. We will 
adopt the term CDR in the following. (8) can be rewritten as 
a parametric line equation in the complex plane, highlighting 
that Vx lies on a straight line connecting T^ and Tg! 

VMf) = r«(/) + + 

Note that the line parameter D{lff) = [CDR{1, /) + 1]“^ is 
equivalent to the diffuseness defined in [18]. 


III. Coherence Models for Dereverberation 

The desired and noise or reverberation components of 
the microphone signals are characterized by time-invariant 
coherence functions Ts{f) and r^(/), respectively. In the fol¬ 
lowing, suitable models for these spatial coherence functions 
are discussed for the application to dereverberation. 


T. sm{kd) 

t diffuse (/) — - ^ - 


sm{27r fd/c) 
27r fd/c 


( 11 ) 


While diffusivity of the noise field is easily motivated in 
the aforementioned scenarios, a few more considerations are 
necessary for the modeling of a reverberation component 
originating from a single excitation signal. Since acoustic 
transmission within a room is generally assumed to be lin¬ 
ear and time-invariant, a reverberant signal can be modeled 
by the convolution of a source signal with a time-invariant 
room impulse response (RIR) [20]. The reverberant signals 
recorded at two points in space, i.e., by two microphones, 
are therefore linearly related, and the theoretical coherence 
function between these two signals is equal to one. However, 
when limited observation windows are considered, and the 
excitation signal has a limited temporal correlation, refiections 
with different delays can be approximated as uncorrelated 
sources. This uncorrelated scattering assumption is widely 
used in mobile radio communications [21] and underwater 
acoustics [22], and is useful in room acoustics as well, where 
it has been observed that the sound field in a reverberant room 
appears as an approximately diffuse sound field [23], [24]. The 
plausibility of the diffuseness assumption for reverberation 
can be visualized using the image source model [25]: for 
higher refiection orders, the angular distribution of the image 
sources becomes increasingly isotropic. Furthermore, given 
a limited observation window length, the delayed reflected 
versions of the source signal are increasingly decorrelated 
with increasing refiection orders. Based on this idea, we can 
predict a number of factors which contribute to how well 
the model of diffuseness is fulfilled: a large room contributes 
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to the uncorrelatedness of the image sources, due to larger 
relative delays between reflections; highly reflective surfaces 
contribute to the presence of many image sources with similar 
power, since the power contributed by reflections decays more 
slowly with the reflection order; and low temporal correlation 
of the source signal contributes to low correlation between 
the delayed reflections. Some of these effects are illustrated in 
Section VI-B using measured and simulated RIRs. 

In real rooms, effects like diffraction, diffuse reflection 
[20], and potentially time-variant effects [26] may further 
contribute to the randomization of delays and incidence angles 
of reflections and therefore increase the diffuseness of the 
reverberation sound held. However, as shown later, the image 
source model is sufficient to explain a wide range of practical 
effects which affect the reverberation coherence. 

While the diffuse sound held model is the most common in 
room acoustics and signal enhancement, it has been observed 
that reverberant noise in rooms with highly absorbing floors 
and ceilings can be modeled more accurately by noise sources 
distributed in the horizontal plane, i.e., by a 2D isotropic 
(cylindrically isotropic) noise held, as opposed to a diffuse 
(spherically isotropic) noise held [27]. This noise held model 
consists of uncorrelated noise sources located on a circle 
around and in the same plane as the microphones (typically 
the horizontal plane), and is motivated by the rapid decay of 
all vertically propagating sound components due to the strong 
absorption at the floor and/or ceiling. The corresponding spa¬ 
tial coherence function for two omnidirectional microphones 
located in the same plane as the noise sources is the zeroth- 
order Bessel function of the first kind [23], [28]: 


r 2 D-iso(/) = Mkd) = Jo{27rfd/c). (12) 

Note that, both in the case of diffuse and 2D-isotropic noise 
fields, the coherence function is real-valued, since the spatial 
distribution of the sources is symmetric with respect to the 
microphone array axis. 

In Section VI-B, the effects of room geometry and surface 
reflectivity on the coherence of the reverberation component 
are evaluated using RIRs generated with the image source 
method, and RIRs that were measured in different rooms. 

IV. Coherent-to-Diffuse Power Ratio Estimation 

For most proposed postfllters, the gain function has been 
formulated directly as a function of auto- and cross-power 
spectral estimates [8], [9], which are typically obtained from 
the microphone signals by recursive averaging: 

K., (i, /) = - 1 , /) + (1 - x)Xi{i, f)x*{i, /), 

(13) 

where A is a constant between 0 and 1. We follow a different 
approach where we first derive an SNR estimate, which can 
then be used to apply any suppression technique such as the 
Wiener Alter or spectral subtraction [29]. Furthermore, we 
write the estimate not as a function of auto- and cross-power 
spectral estimates, but as a function of the estimated short- 
time spatial coherence, which allows additional insight into 


the behavior of the estimator. The short-time coherence is 
estimated by 

tx{l, f) = (^ 4 ) 

\J^X^xS,f)^X2X2{hf) 

Since the focus is on estimating the SNR for a mixture of 
a fully coherent signal with |rs(/)| = 1 and isotropic noise 
with r„ G M, where typically r„(/) = rdiffuse(/), we use the 
term CDR instead of SNR for the quantity to be estimated in 
the following. For the application to dereverberation, the CDR 
is equivalent to the direct-to-reverberation power ratio (DRR), 
under the assumption that reverberant sound can be modeled 
as a mixture of a direct component and a perfectly diffuse 
reverberation component which are mutually uncorrelated, 
thus neglecting early reflections. 

The aim is now to estimate the CDR from an estimate of 
the short-time spatial coherence Tx{l, /), exploiting the known 
coherence functions of the signal and/or noise component, 
and the relationship of these coherence models and the mixed 
sound held coherence to the CDR given by (9). Solving (9) 
for the CDR yields (for brevity, the time- and frequency- 
dependency is omitted in the following) 


CDR = 



or, reformulated as the diffuseness D, 


(15) 


D = 


1 

CDR^l 


r,: -r, 

Fn - r, ‘ 


(16) 


Although Tx and T^ may be complex, the CDR and dif¬ 
fuseness are real-valued quantities; however, when inserting 
a coherence estimate T^^ for T^^ in (15), the resulting values 
are in general complex-valued, due to mismatch between the 
coherence models and the actual acoustic conditions, and the 
variance of the coherence estimate. Estimating the CDR by 
direct application of (15) is therefore not feasible, which is 
why a number of different estimator implementations, which 
yield a positive, real-valued CDR estimate for all possible 
values of Ta^, iTa^l < 1, have been proposed. 

In the following, first, the interpretation of the estimator 
behavior in the complex plane is discussed. Then, existing 
and novel approaches to CDR estimation are analyzed. For 
an easier comparison, the estimators are reformulated as a 
function of only the coherence estimate T^^ and the assumed 
coherence models Tg and T^, where Tg is the direct signal 
coherence computed according to (10) from an a-priori known 
or estimated TDOA At, and T^ is assumed to match the 
diffuse coherence model (11). We start with methods which 
make use of both Tg and T^, i.e., exploit information on 
the DOA and the noise coherence, continue with DOA- 
independent estimators which exploit only the knowledge of 
Tn, and Anally propose a CDR estimator for the case of 
available signal coherence Tg, but unknown noise coherence. 
Table I summarizes the presented estimators and their main 
properties. Finally, estimator bias and robustness are evaluated. 
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A. Interpretation of Estimator Behavior in the Complex Plane 

Fig. 1 shows the output of the estimators which are de¬ 
scribed in the following sections in the complex plane of 
possible coherence values F^^. Results for a direct signal 
TDOA At = 0 (broadside) are shown in the first row, while 
in the second row, results are shown for At = A. For all 
estimators, Vg = Fg, F^ = F^ is assumed. The symbol 
o marks the coherence of a fully coherent signal with the 
respective TDOA according to (10), while the symbol x marks 
the coherence of an ideal diffuse signal given by (11). The 
straight white line between these points marks the theoretical 
coherence values which would occur under ideal conditions 
for different CDR values, according to (9). The bias of a 
CDR estimator is henceforth defined as the deviation of the 
estimator from (15) for coherence values along this line; i.e., 
an unbiased estimator should exactly match (15) for these 
values. This can be verified by inserting F^^ according to (9) for 
Tx into the estimator equation, which yields CDR = CDR 
for an unbiased estimator. Furthermore, since the coherence 
estimates F^^, which are observed in practice, will not lie 
exactly on the line, a good estimator should also be robust 
in the sense that some deviations of the coherence estimate 
from the assumed model, e.g., caused by an imperfect DOA 
estimate, do not lead to large deviations of the CDR estimate. 
In Fig. 1, robustness can be seen in the change of the CDR 
estimate for coherence values slightly deviating from the line; 
if these changes are abrupt, as in Fig. lb for coherence values 
close to the unit circle, this indicates non-robust behavior. 
While we do not derive a measure for the overall robustness 
of an estimator, which would require establishing a statistical 
model for the errors, we evaluate the behavior of the different 
estimators with coherence model errors in Section IV-E. 


B. CDR Estimation for Known DOA and Noise Coherence 

Using the same model as described in Section II, McCowan 
and Bourlard [9] derived the Wiener postfilter for a coherent 
signal in diffuse noise. Jeub et al. [30] evaluated this postfilter 
for the suppression of reverberation, and formulated a CDR 


Table I 

Overview oe investigated CDR estimators, required prior 

INEORMATION (NOISE AND/OR SIGNAL COHERENCE) AND UNBIASEDNESS. 


Estimator Definition Required Unbiased 


Jeub 

r n Re{r g r aj } 

fn, fs 

no 


Re{f*fa;}-1 



Thiergart 1 


fn? fs 

yes 

Proposed 1 

Re{f*(fn-fa.)} 

Re{f*fa.}-1 

fn? fg 

yes 

Proposed 2 

1-fn cos(arg(fs)) I f*(fn-ra;) 

r r 

yes 

|f„-fs| Re{fjf*}-1 

-L n? -L s 

Thiergart 2 

Re| . I 

fn 

no 

Proposed 3 

(25) 

fn 

yes 

Proposed 4 

(27) 

f. 

yes 


estimate based on the same model [10]. Both McCowan 
and Jeub rely on the assumption that the direct signal is 
time-aligned in both microphones, which can be achieved 
^ applying a delay corresponding to the TDOA estimate 
At to one of the channels [30]. In the STFT domain, this 
delay is equivalent to a phase rotation of the cross-power 
spectrum (assuming that the delay is significantly shorter than 
the transform length), and can therefore be represented in the 
CDR estimator equation by multiplying the complex rotation 
factor _ p* coherence estimate F^^. This 

allows the formulation of the CDR estimator including time 
alignment as a function of only F^^, F^ and F^: 


CDi?jeub(^ /) = max 


( ’ - 1 ) 


= max 


/ f„-Re{f:U} \ 

( ’ Re{f*fU - 1 ) 


(17) 


The maximum operation is required to prevent negative results 
for the CDR estimate. This estimator is unbiased for 1^5 = 1, 
i.e.. At = 0. However, for non-zero TDOAs, the phase rotation 
of the coherence estimate Vx does not only affect the direct 
signal component, but also the coherence of the diffuse signal 


f) proposed 3 g) proposed 4 

b) Thiergart 1 c) proposed 1 d) proposed 2 e) Thiergart 2 unbiased unbiased 

a) Jeub unbiased unbiased unbiased DOA-indep. DOA-indep. noise-indep. 



-1 0 1-1 0 1-1 0 1-1 0 1-1 0 1-1 0 1-1 0 1 


Figure 1. Coherent-to-diffuse power ratio estimates obtained from different estimators (columns) as a function of the complex spatial coherence estimate 
Fee. The theoretical coherence of fully coherent (Fg) and fully diffuse (F^) signals is marked by o and x, respectively, while the theoretical coherence of 
mixed signals lies on the connecting line. Estimators are computed using Fg = Fg, F^ = F^. Parameters d = 8 cm, / = 1 kHz, different TDOAs (rows). 
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component. Since this is not accounted for by this estimator, 
the estimate is biased for non-zero TDOAs. The estimator is 
illustrated in Fig. la. 

Thiergart et al. [11], [13] proposed to estimate the CDR by 
directly inserting the target signal coherence estimate Tg into 
(15), and taking the real part: 


(7T)i?Thiergartl(^ /) 


= max 




(18) 


While this estimator is unbiased, it was found to be very 
sensitive towards phase deviations of the coherence estimate 
from the ideal model [13]. For a measured coherence with a 
magnitude close to one, even a small phase difference between 
Tx and F^ can have a large effect on the CDR estimate. This 
can be seen in Fig. lb, where, unlike in Fig. la, the CDR for 
coherence values close to the unit circle sharply drops to zero, 
and is shown in more detail later. 

Based on (17), an unbiased CDR estimator can be formu¬ 
lated [15]. The diffuse coherence model is first corrected to 
account for the phase rotation of the coherence estimate by 
multiplying the diffuse noise coherence F^ with the phase term 
g-i27r/At well, which removes the bias of the estimator, 
while preserving the robust properties of (17) against phase 
errors (see Fig. Ic): 


CDRprop\ {l,f)= max 0 , 


Re{e-i 2 ^/Stf^} - 1 


= max 


/ Re{f:(f„-f,)} \ 

( ’ Re{f*f,} -1 )' 


(19) 


This estimator is identical to (17) for = 1 , i.e.. At = 0. 
Note that an equivalent CDR estimate can be derived from 
the maximum likelihood noise variance estimator which was 
proposed in [31] and applied to noise reduction in [32]. 

For a second, heuristically motivated variant of an unbiased 
estimator, the real part in the numerator of (19) and the max 
operator are first replaced by the magnitude of the entire term. 
The resulting estimator was found to lead to an increased 
performance for the application to dereverberation [33]: 


CDi?prop 2 (^, /) 


f:(f^-f.) 

Rejfjf^} - 1 


( 20 ) 


This estimator however has a small bias for non-zero TDOAs; 
a correction term for this bias can be computed by inserting 
(9) into (20) and solving for . The bias-compensated 

C'^^prop 2 

estimator is then given by 


65Rprop2(/,/) 


1 - f„ cos(arg(f 8 )) 


CDRp,,p,{l,f), ( 21 ) 


and is illustrated in Fig. Id. Compensation of this small bias 
however only has a negligible effect on practical performance. 


The derivation of these estimators shows that, when both 
knowledge of the signal and noise coherence are available, sev¬ 
eral different unbiased CDR estimators can be implemented. 
The reason for this is that the requirement of unbiasedness 
only defines the behavior of the estimator for coherence values 
matching the model given by (9), i.e., the values on the 
line in Fig. 1, while allowing arbitrary behavior for other 
coherence values. While the second proposed unbiased variant 
has significant practical advantages, as shown in the qualitative 
analysis of the estimator behavior in Section IV-E and the 
signal-based evaluation in Section VI, it does not seem to be 
optimal in any sense. A possible direction for future work 
would therefore be to establish a statistical model for the 
deviations of F^^ from the theoretical model given by (9), and 
derive a correspondingly optimized unbiased estimator. 


C. CDR Estimation for Unknown DOA 

The previously shown methods rely on prior knowledge or 
an estimate of the target DOA. As an alternative, Thiergart et 
al. [11], [13] proposed to use the instantaneous phase of the 
estimated cross-power spectrum ^xiX 2 a phase estimate for 
the direct signal model, i.e., Vg = ^ thus removing 

the need for explicit DOA estimation to obtain Tg. Since, 
according to (14), argE^^ = djcg^xxx 2 ^ this estimator can be 
formulated as a function of only the coherence estimate F^^ 
and the noise coherence F^: 

C'DRThiergart 2 ((, /) = max ^0, Re | p 

However, the instantaneous phase of the mixture is not an 
unbiased estimate of the phase of the direct signal component, 
since, for low CDR values, the coherence of the mixture is 
dominated by the coherence of the diffuse signal component 
[13], which is real-valued, i.e., has a phase of zero. For 6 > 7 ^ 0 °, 
the estimator is therefore biased. The behavior of the estimator 
is illustrated in Fig. le. 

As shown in [15], it is possible to derive an unbiased CDR 
estimator which does not require an estimate of the source 
DOA, since the knowledge that \rg\ = 1, i.e., that the direct 
signal is fully coherent, is sufficient to solve (15). This can 
be explained using a geometric interpretation: according to 
(9), Tx, Eg and F^ all lie on a straight line in the complex 
plane, and it is furthermore known that Tg lies on the unit 
circle and F^ on the real axis. F^ can therefore be obtained 
by the intersection of the line through F^ and F^^ with the unit 
circle, and inserted into (15). An alternative way of obtaining 
this solution is by solving (9) for Tg and setting the magnitude 
to 1 : 


|r, I = |r, - (r„ - r,) cdr -^ | = i, (23) 


^ f„Re{fj-|f,|'-Vf2Re{f,}'-f2|f,|%f2-2f„Re{fj + |f,|' 

C-L/-n.pj-op3 (t5 /) ^2 (2^) 

irj -1 
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which leads to a quadratic equation for the CDR: 

(|r^|2 - l)CDR^ - 2Re{r^(r„ - T^y}CDR 

+|r„-r,|2 = o. (24) 

Taking the positive of both possible solutions yields the 
unbiased DOA-independent CDR estimator which is given 
by (25) and illustrated in Fig. If. In contrast to the DOA- 
dependent estimators, where an infinite number of unbiased 
estimators exists, the DOA-independent estimator is uniquely 
determined by the requirement of unbiasedness. 


D. CDR Estimation for Unknown Noise Coherence 

From the geometric interpretation of the coherence of mixed 
sound fields it can be analogously concluded that knowledge 
of F^ is not required when Vg is known, since the noise 
coherence is assumed to be real and therefore determined by 
the intersection of the real axis and the line through F^ and 
Vx. Using Im{r^} = 0, F^ can therefore be eliminated from 
(15), resulting in 


CDR 


im{rj 

Im{r,}-Im{rj' 


(26) 


When using this formulation with the estimates f and f ^ 
as an estimator for the CDR, practical problems occur in 
cases where, due to model mismatch and coherence estimation 
errors, the imaginary part of the coherence estimate ImlFa^} 
has either values with a larger magnitude than ImlFg}, or a 
different sign, in which case this equation would not yield 
a meaningful result. For this reason, the CDR estimate is 
continuously extended into these two problematic regions by 
returning an infinite CDR in the former case, and a CDR of 
zero in the latter case. The final proposed estimator is then 
given by 


CDR^,oA^,f) = 




0, 


foro<^<i 

for < 0 . 

(27) 

An inherent constraint that limits practical applicability of this 
estimator is that argFg 7 ^ 0 , since otherwise the imaginary 
parts disappear; i.e., the estimator is not usable for At = 0 , 
and increasingly sensitive towards estimation errors for small 
TDOAs. The estimator is visualized in Fig. Ig. Note that in 
[34] a noise power spectrum estimate was derived in a similar 
way from the imaginary part of a cross-power spectrum. 


E. Evaluation of Estimator Bias and Robustness 

To illustrate the bias of the estimators CDi^jeub, the un¬ 
compensated estimator CDR^^q ^2 CD^Thiergart, 2 , Fig- 2 
compares the true CDR value and the different estimates for 
mixtures of coherent and ideally diffuse signals for a TDOA 
At = ^ (corresponding to the values along the white line in 
Fig. 1, second row). The proposed estimators are all unbiased, 
as is the DOA-dependent estimator proposed by Thiergart et 
al. (18). The estimator by Jeub et al. (17) and the DOA- 
independent estimator by Thiergart et al. (22) both have a 


(a) / = 1 kHz 


(h) f = 3 kHz 




10 logio [dB] 


-proposed 2 (uncompensated) 

-unbiased (Thiergart 1, proposed 1, 2, 3, 4) 

- Thiergart 2 DOA-ind. 

- Jeub 


Figure 2. Comparison of true CDR and estimated CDR. Parameters d = 
8cm, At = At = fj, f = 1 kHz (left), 3kHz (right). 



Fri - 
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(b) CDR = -lOdB 



arg f s — arg Fg [rad] 

(d) CDR = 10 dB 



Pn — Pn arg f s — arg Fg [rad] 


-proposed 2 ■ - ■ 

■ - proposed 1 

- Thiergart 1 - 



Figure 3. CDR estimation error for noise and direct signal coherence model 
errors. Parameters d = 8cm, At = fj, f = 1 kHz. 


significant bias, with the former under- or overestimating 
the CDR depending on the values of At and /, and the 
latter always underestimating the CDR. Also shown is the 
uncompensated version of the proposed estimator 2 ( 20 ), 
which has a small, TDOA- and frequency-dependent bias (for 
/ = 3 kHz, the difference to the unbiased case is too small to 
be noticeable in the plot). 

Fig. 3 shows the CDR estimation error for cases where 
the actual coherence of the noise F^ or the direct signal 
component Tg deviates from the assumed coherence models 
Tn and Tg, respectively. Fig. 3a and b show the error for a 
low CDR of — lOdB, while c and d show results for a high 
CDR of lOdB. The DOA-independent estimator CDRpYop 3 is 
naturally unaffected by the phase error of the direct signal 
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Figure 4. Coherence-based noise and reverberation suppression system 
consisting of a preprocessor and a CDR-based postfilter. 


coherence model, as seen in Fig. 3b and d; however, for errors 
of the noise coherence, the CDR is quickly overestimated by 
the DOA-independent estimator (see Fig. 3a). The estimator 
&T)^Thiergart,i has the problem of reacting strongly to small 
phase deviations when the CDR is high (see Fig. 3d). Compar¬ 
ing the different unbiased DOA-dependent variants CD^propi 
and CDR^ropi, it can be stated that &D^prop 2 seems slightly 
more tolerant towards model errors, which could explain the 
better performance of this estimator for signal enhancement. 


V. Application to Speech Enhancement 


Fig. 4 shows the structure of the proposed reverberation or 
diffuse noise suppression system based on short-time CDR 
estimates. First, the microphone signals are combined by 
averaging the squared magnitudes and using the phase from 
one of the microphone signals: 


Y{l,f ) =-v/|Xi(l,/)|2 + \X2{l,fW ■ (28) 


Spatial magnitude averaging in the STFT domain is typically 
used to reduce the variance of spectral estimates for the 
computation of microphone array postfilters [9], but has also 
been used as a preprocessor for signal enhancement [35]. It 
is used here with the purpose of reducing the variations in 
the transfer function which are caused by constructive and 
destructive interference of early refiection components with 
the direct path. For the computation of the coherence-based 
postfilter gain G(/,/), short-time estimates of the 

spatial coherence are first obtained according to (14) from 
spectra which have been estimated by recursive averaging. 
From the coherence, the CDR is estimated based on models 
for the direct signal and/or reverberation coherence, where the 
direct signal coherence is derived from a known or estimated 
TDOA, and the reverberation coherence is assumed to be 
known. A postfilter gain is then computed using spectral 
magnitude subtraction [29]: 


GG,/)=maxJG^in,l- ^ i, (29) 

[ ^ CDRilJ)+ lj 

with the oversubtraction factor ji and the gain fioor Gmin- The 
output signal is computed by applying the postfilter gain to 


the preprocessed signal Y (/, /), i.e., Z{1, /) = G(/, f)Y (/, /), 
and transformed back into the time domain. Since the prepro¬ 
cessor does not have any spatial filtering effect, the postfilter 
gain can be directly applied to the preprocessor output, and 
does not require a correction to account for spatial filtering, 
as it would be the case for a beamformer as preprocessor [8]. 

Note that, when employing a DOA-independent CDR esti¬ 
mator, the proposed signal enhancement system is completely 
independent of the DOA of the target signal. 

VI. Evaluation 

In the following, the spatial properties of reverberation are 
first evaluated using simulated and measured RIRs, in order 
to verify the assumptions made in Sect. III-B. Then, the 
estimation accuracy of the CDR estimators and the effect of the 
proposed CDR-based dereverberation system are evaluated. 

A MATLAB implementation of the proposed CDR estima¬ 
tors and signal enhancement scheme is provided online^. 

A. Setup and Parameters 

Eor the main evaluation, sets of measured RIRs from three 
rooms are used: 

• Room A: 6mx6mx3m, partially closed curtains on 
walls, Teo ~ 0.4 s 

• Room B: 7mx llmx3m (lecture hall), Teo ~ 1 s 

• Room C:54mx7mx3m (large foyer). Tqq ^ 3.5 s 
The reverberation time Teo was measured from the energy 
decay curve of the RIR. In each room, RIRs were measured for 
40-70 different source positions in / = 1, 2 and 4 m distance 
from the microphones, in the angular range 0 = —90 ... 90 °. 
Microphones are spaced d =8 cm apart. 

Additionally, the RIRs that were used in the REVERB 
challenge [36] for the generation of multi-condition training 
data are evaluated. These RIRs were measured using an 8- 
channel circular microphone array with a diameter of 20 cm 
(corresponding to d = 8 cm spacing between neighboring 
microphones) in 6 different rooms (SRl/2, MRl/2, LRl/2), 
for two source-microphone distances (^0.5 m and ^2m), and 
two different angles of the source w.r.t. the microphone array. 
The rooms have the following properties (note that SR2 and 
LR2 are the same rooms as A and B, respectively): 

• SRI (“Small Room 1”): variable reverberation room, 
4.5 m X 3.5 m x 3 m, Tqq ^ 0.2 s 

• SR2 (“Small Room 2”): room A, but curtains fully closed, 
T6o«0.2s 

• MRl (“Medium Room 1”): same as SRI , Teo ~ 0.5 s 

• MR2 (“Medium Room 2”): meeting room, 5 m x 3.5 m x 
3 m, Teo ~ 0.6 s 

• LRl (“Large Room 1”): same as SRI, Tqq ^ 0.8s 

• LR2 (“Large Room 2”): room B 

In the following, all processing takes place at a sampling 
rate of 16 kHz. Eor the transformation into the time-frequency 
domain and short-time spectral estimation, a DET-based uni¬ 
form filterbank with window length 1024, LET size 512, and 
downsampling factor 128 is employed [37]. The short-time 

^http://www.lms.lnt.de/files/publications/cdr-dereverb.zip 
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(a) Reflective room (/? = 0.9) 




/[kHz] 


/[kHz] 


(b) Absorbing floor and ceiling (/Swalls = 0.9, /Spioor.Ceil = 0.1) 




/[kHz] /[kHz] 

(c) Absorbing walls (/9waiis = 0.5, Aioor, Ceiling = 0.9) 




/[kHz] /[kHz] 


-3D isotropic - 2D isotropic - measured 


(a) Small room 1 


(b) Small room 2 



/[kHz] /[kHz] 


(c) Medium room 1 


(d) Medium room 2 



/[kHz] 


/[kHz] 



(e) Large room 1 


(f) Large room 2 



/[kHz] 


/[kHz] 


-3D isotropic - 2D isotropic - measured 


Figure 5. Spatial coherence estimated from the reverberation tail of simulated 
RIRs, averaged over 7 microphone pairs with spacing d = 8 cm, for different 
reflection coefficients /3, compared to coherence of diffuse and 2D isotropic 
sound flelds. Left: small room (4 x 3 x 2.5 m), right: large room (15 x 18 x 
10 m). 


Figure 6. Spatial coherence estimated from the reverberation tail of measured 
RIRs from the REVERB challenge, averaged over 7 microphone pairs with 
spacing d = 8 cm. 


coherence estimates are obtained by recursive averaging of 
the auto- and cross-power spectra according to (13), with the 
forgetting factor A = 0.68. 

B. Spatial Properties of Reverberation in Simulated and Mea¬ 
sured Rooms 

For the evaluation of the spatial characteristics of reverbera¬ 
tion, we use simulated and measured RIRs. The reverberation 
tail of the RIRs is extracted by removing the initial part 
containing the direct path and early reflections (see Appendix), 
using a typical value of Te = 50 ms for the cutoff time between 
early reflections and reverberation [20]. The late RIRs are 
convolved with a speech signal, transformed into the STFT 
domain, and the spatial coherence is estimated from auto- and 
cross-power spectra estimated by averaging over an interval 
of 10 s. 

First, RIRs are generated using the image method [25], [38]. 
In the simulations, a uniform linear array (inter-microphone 
spacing d = 8 cm) is placed horizontally in the center of 
rectangular rooms with varying dimensions and reflectivities. 
The image source order is chosen sufficiently high to include 
all refiections within 60 dB of the main peak. In order to re¬ 
duce the variance of the estimate for a better visualization, the 


coherence is also spatially averaged over the estimates from 
7 microphone pairs [24]. Fig. 5 shows plots of the real part 
of the resulting coherence, for a large room (15 x 18 x 10 m, 
left) and a small room (4 x 3 x 2.5 m, right); for both rooms, 
three configurations for the surface reflectivity p are used: 
equally high reflectivity for all surfaces = 0.9), highly 
absorbing floor and ceiling (/^waiis = 0.9, ^vioor, Ceii = 0.1), and 
moderately absorbing walls (/dwaiis = 0.5, ^vioor, Ceii = 0.9). 
The results in Fig. 5 confirm the assumptions on the coherence 
properties of reverberation that were made in Section III-B: 
for equal refiectivity of all surfaces, the coherence closely 
matches the coherence of the diffuse sound field. If fioor and 
ceiling are highly absorbing, the model of a 2D isotropic sound 
field is appropriate. If instead the walls are more absorbing 
than fioor and ceiling, the coherence is significantly higher 
than the diffuse coherence, since the dominating vertically 
propagating components are strongly correlated between the 
horizontally spaced microphones. Also, the variance of the 
coherence estimate is visibly lower in the larger room. 

Fig. 6 shows the reverberation coherence estimates obtained 
from the RIRs of the REVERB challenge database, estimated 
in the same way as for the simulated RIRs. The coherence 
estimates are obtained for 7 pairs of neighboring microphones 
from the circular array and averaged. Most rooms match the 
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(a) Room A (b) Room B 



/[kHz] /[kHz] 


(c) Room C 



/[kHz] 


-3D isotropic - 2D isotropic - measured 


Figure 7. Spatial coherence estimated from the reverberation tail of measured 
RIRs in rooms A, B, C, one microphone pair with spacing d = 8 cm. 

diffuse model quite well, with two exceptions. In SR2, the 
coherence is higher than expected from the diffuse model, 
which can be explained by the presence of absorbing curtains 
on all four walls. In MR2, the coherence however almost 
perfectly matches the 2D isotropic model, since in this room, 
walls are more reflective than floor and ceiling. Also, it can 
again be observed that the variance of the coherence estimate 
is lower for rooms with a longer reverberation time. 

Fig. 7 shows the results for one position in the rooms A, 
B and C. The coherence estimate is here computed just from 
one pair of microphones, therefore the variance is signiflcantly 
higher. The diffuse model is a good fit for rooms B and C, 
where all surfaces are highly reflective. In room A, the co¬ 
herence is similar to the simulated case of partially absorbing 
walls, which is due to the presence of partially closed curtains 
on the walls of the room. 

Concluding the analysis of the spatial properties, it can 
be stated that, for microphones located in the same hori¬ 
zontal plane, the spatial coherence of reverberation in real 
rooms typically lies between the coherence of diffuse and 2D 
isotropic noise, with some exceptions where the coherence 
is increased due to dominant vertical reflections. The diffuse 
model is a good fit for most rooms, unless there are large 
differences in the reflectivity of the room surfaces. Finally, it 
is noteworthy that the image source model with sufficient order 
can reproduce the spatial characteristics of late reverberation 
which are observed in real rooms. 

C. CDR Estimation for Reverberant Speech 

In Section II, a reverberant speech signal is modeled as 
consisting of a directional and a diffuse component, which 
are mutually uncorrelated. In practice, the reverberant sound 
held consists of the direct path, several spatially distinct early 


reflections, and the reverberation component, all of which are 
not perfectly uncorrelated, due to the non-zero length of the 
observation window and the temporal correlation of speech 
signals. In the previous section, it was shown that the model 
of a diffuse sound held is appropriate for the reverberation 
component. In the following, it is investigated whether the 
simplifled model of a mixture of uncorrelated directional 
and diffuse sound flelds can be applied to real reverberant 
speech signals, i.e., whether the CDR estimate can be used 
as a practical measure for the time- and frequency-dependent 
ratio between desired and undesired signal components, as it 
is required for speech enhancement. We now consider the 
desired signal components to be the direct path plus the 
reflections arriving within = 50 ms after the direct path, 
and the undesired components to be the energy caused by 
the reverberation tail of the RIR. This is motivated by the 
well-known effect that early reflections are beneflcial both 
for speech intelligibility [39] and ASR accuracy [40], and 
should therefore be considered part of the desired signal. 
In other words, the relevant SNR to be estimated for the 
application to signal enhancement is the early-to-late power 
ratio £'Ti? 50 ms(^/) (see Appendix). 

To exemplarily illustrate the relationship between the (non¬ 
stationary) early-to-late power ratio and the short-time co¬ 
herence estimate, the time-frequency bins of a reverberant 
speech signal are first classified according to the instantaneous 
ELR^o ms into low-reverberant and highly reverberant, and the 
corresponding distribution of the short-time estimates of the 
complex coherence is visualized as a histogram. Fig. 8 shows 
the two-dimensional histograms of the complex coherence of 
bins with ELR > 10 dB (left) and ELR < — lOdB (right) 
around / = 1 kHz. The coherence of the low-reverberant 
bins matches the coherence of a single plane wave quite 
well, although the signal contains contributions from early 
reflections in addition to the direct path. The phase has a slight 
spread, caused by early reflections; this has to be tolerated by 
the CDR estimator. The coherence of the highly reverberant 
bins, which should lie close to the diffuse model coherence, 
has a considerably higher spread and is not exactly centered 
around the model. This indicates that, while the simplifled 
model seems to be reasonable, errors are non-negligible, and 
the differences in the realizations of the unbiased estimators, 
which affect only the behavior for values deviating from 
the ideal model, are likely to have a significant impact on 
estimation performance. 

For the comparison of the estimation performance of the 
different estimators, it is convenient to transform the true 
and estimated CDR into the true and estimated diffuseness 
D = [CDR + 1]“^ and D = [CDR + 1]“^, respectively, 
due to the diffuseness being bounded between 0 and 1, and to 
evaluate the mean squared error MSE = £{\D—D\‘^}. For this 
evaluation, the true CDR is again approximated by the ELR 
(CDR ^ ELRsoms)^ and the expectation is approximated by 
averaging over time and frequency. The coherence models F^ 
and F^ for the estimators are based on the measured TDOA 
and the diffuse coherence assumption, respectively. Table II 
shows the MSE for the different estimators, averaged over 
all source positions in the respective room. The estimator 
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-1 - 0.5 0 0.5 1 -1 - 0.5 0 0.5 1 


Re{f^} Re{f^} 

Figure 8. Histogram of complex coherence values Fx measured from a 
reverberant speech signal, for time-frequency bins with ELR^q ms > 10 dB 
(left) and < -lOdB (right). Room B, / = 2m, d = 8cm, = 60°, / = 
1 kHz). Theoretical signal coherence Fg computed from measured TDOA and 
diffuse noise coherence Fn are marked by o and x, respectively. 

Table II 

Estimation error of different CDR estimators. 


CDR est. 


Jeub 

Thiergart 1 * 

proposed 1 * 

proposed 2 * 

Thiergart 2 

proposed 3 * 

proposed 4 * 

Prior inform. 

DOA, r„ 

DOA, r„ 

DOA, r„ 


DOA, r„ 


r„ 


r„ 

DOA 

Room A 

■ 

0.182 


■ 

0.166 

■ 

0.095 

1 

0.062 

1 

0.057 

■ 

0.243 

Room B 

■ 

0.146 

0.301 

B 

0.140 

■ 

0.086 

■ 

0.090 

■ 

0.087 

■ 

0.212 

Room C 

1 

0.080 

■ 0.235 

1 

0.080 

1 

0.066 

■ 

0.103 

■ 

0.104 

B 

0.159 

MRl 

■ 

0.131 

^■0.373 

■ 

0.114 

1 

0.069 

1 

0.059 

1 

0.052 

B 

0.171 

MR2 

■ 

0.111 

0.287 

■ 

0.092 

1 

0.061 

1 

0.073 

1 

0.066 

B 

0.159 

LRl 

■ 

0.119 

^B 0.313 

■ 

0.109 

1 

0.068 

1 

0.067 

1 

0.063 

B 

0.170 

LR2 

1 

0.073 

H 0.262 

1 

0.059 

1 

0.047 

1 

0.071 

1 

0.069 

B 

0.134 

Mean 

■ 

0.120 

0.322 

■ 

0.109 

1 

0.070 

1 

0.075 

1 

0.071 

B 

0.178 


* unbiased 


(7i^^Thiergart,i has a relatively high estimation error, due to 
the high sensitivity of this estimator towards phase variation 
of the coherence. The estimator CDR^^opi shows a slightly 
reduced estimation error compared to the biased estimator 
CDRjq^]j, while the variant CDR^^opi further reduces the 
error. Among the DOA-independent estimators, the proposed 
unbiased version leads to an error reduction as well, while the 
noise coherence-independent variant CDR^yo ^4 has the overall 
second-highest error, due to the difficulties in cases where the 
phase of the coherence is close to zero. 


D. Dereverberation Performance 

In the following, the signal enhancement system described 
in Section V is evaluated for the application to dereverber¬ 
ation. For all of the following results, two-channel signals 
are processed by first applying spatial magnitude averaging 
as described by (28), and then applying a postfilter based 
on the different CDR estimators, or one of several other 
dereverberation methods used for comparison. 

1) Measures and Evaluation Method: To quantify the 
amount of reverberation in the unprocessed and processed 
signals, the time- and frequency-averaged early-to-late power 
ratio ELR^o^s is evaluated (see Appendix). The amount 
of signal distortion caused by the postfilter is quantified by 
the frequency-weighted segmental signal-to-distortion ratio 
(fwSegSDR), which we define as the fwSegSNR [41] com¬ 
puted for the postfiltered early signal component (i.e., the 
signal convolved with the first 50 ms of the RIR), with the 
unprocessed early signal component Yg as the reference: 

fwSegSDR = fwSegSNR{Ye{l, /), G(/, f)Ye{l, /)) (30) 


The overall quality of the processed signals, including both 
the effects of reverberation reduction and undesired speech dis¬ 
tortion, is evaluated using the recognition rate of an automatic 
speech recognizer. The ASR engine PocketSphinx [42] is used 
with an acoustic model trained on clean speech from the GRID 
corpus [43], using MFCC-fA-fAA features. Cepstral mean 
normalization is used for the equalization of the effect of early 
reverberation [44]. For the computation of the recognition rate, 
only the letter and the number in the utterance are evaluated, as 
in the CHiME challenge [45]. Furthermore, two signal-based 
measures for the overall speech quality are evaluated, which 
were shown to be significantly correlated to the perceived 
amount of reverberation [46]: PESQ [47] and the frequency- 
weighted segmental signal-to-noise ratio (fwSegSNR) [41]. 
We use the wideband version of PESQ and give values in 
the MOS-LQO scale. For both PESQ and the fwSegSNR, the 
clean speech signal is used as reference. 

CDR-based dereverberation is evaluated with all estimators 
discussed in this paper. In addition to the CDR-based methods, 
two heuristic coherence-based postfiltering methods are eval¬ 
uated: a version of Allen’s method [2], where the magnitude 
of the coherence is used as a spectral gain and applied to 
the spatially preprocessed signal, and the coherence-to-gain- 
mapping proposed by Westermann et al. [7], which depends 
on a histogram of the magnitude squared coherence. Also 
evaluated is the exponential decay model by Lebart et al. 
[48], using the true reverberation times measured from the 
RIRs, which in practice would have to be estimated blindly 
from the reverberant signals [49]. For the method of Lebart 
and the CDR-based methods, spectral magnitude subtraction 
according to (29) is applied, with Gmin = 0.1. The suppression 
parameter r is set to 1.3, which yields close to optimum 
recognition rates for all except Lebart’s method (see the 
comment in the following section). Ideal TDOA knowledge 
is assumed for the CDR estimators which require a TDOA 
estimate At, i.e.. At = At. The dereverberation methods 
are evaluated for the rooms A, B, C, MRl/2 and LRl/2. In 
SR 1/2, the very low amount of reverberation (Tqq < 0.3 s) did 
not lead to a significantly lower recognition rate compared 
to clean speech, therefore these rooms are not included in 
the evaluation. Lor each room and source position, 500 GRID 
utterances are convolved with the measured two-channel RIRs 
(in the case of the REVERB challenge RIRs, two neighboring 
microphones are selected from the circular array), and then 
processed by the dereverberation methods. 

2) Results: Table III summarizes the resulting performance 
measurements, averaged over all source positions in each 
room. The first column shows the results for the unprocessed 
microphone signals. The spatial magnitude averaging leads 
to a small but consistent improvement in all performance 
measures, as seen in the second column. 

Postfiltering using the CDR estimator CDR^^q ^2 leads to 
the highest recognition rate among all methods across all 
evaluated rooms, as well as to the highest average PESQ 
score. Comparing the CDR-based methods, the following 
observations can be made: both for the DOA-dependent and 
DOA-independent estimators, all measures refiect the slight 
advantage of the respective unbiased variant (GD^propi and 
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Table III 

Performance measures, averaged over all source positions in each room. First column: unprocessed microphone signal, second 
column: spatially averaged magnitudes without postfiltering, remaining COLUMNS: DIFFERENT POSTFILTERS. 


Preprocessor 

- 

Squared Magnitude Averaging 









Coherenee-based 

Postfilter 





Lebart 









CDR-based 
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- 

Cob. histog. 

DOA, r„ 

DOA, r„ 

DOA, r„ 

DOA, r„ 

r„ 


DOA 

Parameter 

- 



/i- 

=1.3 


- 

kp=0.3Q 

n = 

= 1.3 

n = 

=1.3 

^=1.3 


=1.3 

^=1.3 

^=1.3 

^ = 

=1.3 


Room A 


87.0 


87.1 


1 87.7 


189.0 

^H89.9 


189.0 


1 86.2 

^H89.4 


|90.0 

^■89.8 

^■89.9 


188.2 

£ 

Room B 

1 

49.2 

■ 

49.9 


69.5 


63.5 

■ 67.5 


76.0 


64.7 

76.4 


78.2 

^B 72.4 

■ 73.0 


67.7 

1 

Room C 


36.4 


36.6 

1 

47.8 

■ 

48.1 

■ 51.7 

■ 

65.7 

■ 

53.2 

■ 67.6 

H 

68.6 

B 55.8 

B 56.3 

B 

59.5 


MRl 


77.2 


78.2 


84.8 


83.6 

85 0 


1 85.6 


78.9 

86.6 


1 87.0 

86.1 

86.3 


1 84.1 






















•2 

MR2 

■ 

63.9 

■ 

65.7 


80.0 


74.5 

76.6 


80.1 


70.8 

80.7 


81.9 

79.8 

80.2 

^B 

75.9 

e 

bx 

LRl 


64.8 


65.1 


77.3 


72.8 

75.4 


78.9 


70.2 

79.4 


81.1 

77.9 

78.8 

^B 

75.7 
























LR2 

■ 

57.2 

■ 

58.8 


75.5 


70.4 

73.8 


82.7 


71.6 

83.3 


83.5 

78.6 

78.9 

^B 

79.4 


Mean 

■ 

62.2 

■ 

63.1 


74.7 


71.7 

74.3 


79.7 


70.8 

80.5 


81.5 

112 

77.6 

^B 

75.8 


Room A 


1.51 


1.53 


1.72 


1.58 

1.64 


1.67 


1.46 

1.67 


1.76 

^B 1.64 

^B 1.66 


1.65 


Room B 

1 

1.19 

1 

1.19 

■ 

1.34 

1 

1.23 

1 1.25 

■ 

1.36 

1 

1.26 

■ 1.34 

■ 

1.38 

1 1.27 

■ 1.28 

■ 

1.29 


Room C 


1.13 


1.13 

1 

1.23 


1.14 

1 1.16 

■ 

1.31 

1 

1.21 

■ 1.32 

■ 

1.32 

1 1.17 

1 1.17 

1 

1.26 

o 

MRl 

1 

1.28 

■ 

1.29 

■ 

1.46 

■ 

1.33 

■ 1.41 

■ 

1.37 

1 

1.26 

■ 1.37 

■ 

1.45 

■ 1.41 

B 1.43 

B 

1.38 

Clh 

MR2 

■ 

1.30 

■ 

1.33 


1.56 

■ 

1.40 

■ 1.48 

■ 

1.43 

1 

1.28 

■ 1.45 

■ 

1.57 

H 1.56 

H 1.56 

B 

1.50 


LRl 

1 

1.18 

1 

1.19 

■ 

1.33 

1 

1.21 

1 1.25 

1 

1.24 

1 

1.18 

1 1.22 

1 

1.27 

1 1.24 

1 1.25 

1 

1.25 


LR2 

■ 

1.28 

■ 

1.31 


1.57 

■ 

1.37 

■ 1.50 

■ 

1.54 

1 

1.27 

H 1.57 


1.61 

H 1.58 

H 1.58 

■ 

1.54 


Mean 

1 

1.27 

■ 

1.28 

■ 

1.46 

■ 

1.32 

■ 1.38 

■ 

1.42 

1 

1.27 

■ 1.42 

■ 

1.48 

■ 1.41 

B 1.42 

B 

1.41 


Room A 


6.15 


6.58 


8.34 


7.94 

^■8.96 


7.17 


7.14 

8.63 


8.48 

8.71 

8.73 


6.94 


Room B 

1 

2.07 

1 

2.15 

■ 

6.13 

■ 

4.20 

■ 3.92 

■ 

4.46 

■ 

4.15 

■ 5.81 

■ 

5.45 

■ 5.38 

■ 5.40 

B 

4.04 


Room C 


1.08 


1.31 

■ 

4.58 

■ 

2.89 
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- 
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- 

- 

9.37 


24.73 

16.58 


10.21 


10.43 

11.37 


11.90 

13.23 

13.38 
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* unbiased 


Ci^^props, respectively) over the biased estimators. For the 
DOA-dependent estimator, the variant CDR^^q ^2 further im¬ 
proves the result over the first proposed unbiased estimator, 
due to the different behavior of this estimator for coherence 
values which deviate from the ideal coherence model. The 
significant improvement suggests that further improvement 
may be possible by modeling these deviations statistically and 
explicitly optimizing the estimator for this model. Remarkable 
are the results of the DOA-independent estimators: without 
requiring any knowledge or estimation of source DOA or 
other parameters of the scenario, the CDR-based postfilter can 
significantly increase the overall signal quality according to all 
evaluated measures. 

Compared to CDR-based dereverberation, the methods by 
Allen and Westermann yield a low ELR improvement, and at 
the same time a higher signal-to-distortion ratio. The overall 
improvement in recognition rate and PESQ is relatively low 
for both, while Westermann’s method shows good results for 


the fwSegSNR. The discrepancies between these measures can 
be explained by the different tradeoffs between reverberation 
suppression and signal distortion, which have different effects 
on the evaluated quality measures. Apparently, Allen’s and 
Westermann’s methods apply a lower overall amount of sup¬ 
pression, which benefits the fwSegSNR measure, but has a 
small effect on ASR recognition rate and PESQ. 

It is noticeable that Lebart’s method yields the highest ELR, 
but at the same time the worst signal-to-distortion ratio; this 
indicates that reverberation is overestimated, and consequently 
too much suppression is applied, possibly due to mismatch 
between the exponential decay assumption and the early part 
of the impulse responses [50]. Reducing the suppression gain 
to the optimum value /i = 0.6 to counter overestimation 
increases the mean recognition rate to 77.4 %. 

The estimator (7Di?prop4, which makes no assumption on 
the noise coherence, yields on average comparable results to 
the other estimators, although it can not obtain usable CDR 
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estimates for some of the source positions where the TDOA 
is close to zero. To gain further insight into the behavior 
for different TDOAs, we evaluate the performance for the 
different source positions individually in the following. Fig. 9 
shows the recognition rate for signals processed with the 
proposed unbiased estimators 2, 3 and 4 for the different 
source positions in rooms A, B and C. While dereverberation 
using the heuristic DOA-dependent estimator CDR^^opi yields 
the highest recognition rate in almost all cases, the DOA- 
independent estimator CDRpYop 3 also achieves a significant im¬ 
provement over all angles. The estimator CT)^prop 4 , while not 
usable for DOA 0 = 0 due to the disappearing imaginary part 
of the coherence, remarkably already achieves a significantly 
increased recognition rate for DOAs as small as 10°, and 
similar recognition rates as the DOA-independent estimator 
for higher DOAs. In Room A, where the mismatch between 
the diffuse assumption and the actual reverberation coherence 
is significant, the estimator slightly exceeds the performance of 
the (on average best) estimator CDR^^opi for some positions, 
indicating that in some scenarios it may be of advantage to use 
an estimator which does not assume an isotropic noise field. 

Fig. 10 shows the time-averaged ELR^q^^s for different fre¬ 
quencies before and after processing for an exemplary scenario 
(room B, / = 2m, d = 8cm), where CDR^^o ^2 was used 
for dereverberation. It can be seen that the dereverberation is 
most effective at frequencies above 1000 Hz, but is already 
significant at frequencies as low as 300 Hz. 

VII. Conclusion 

Several well-known and some novel CDR estimation meth¬ 
ods and their application to dereverberation have been in¬ 
vestigated. Using simulated and measured RIRs for different 
environments, it has been confirmed that the commonly used 
model of a reverberant speech signal as a plane wave in diffuse 
noise is sufficiently accurate to justify the application of CDR- 
based signal enhancement to dereverberation. However, the 
known CDR estimators were found to be either biased or not 
robust enough for practical application to signal enhancement. 
It has been shown that several variants of unbiased estimators 
can be derived which improve robustness towards model er¬ 
rors, and that knowledge of either the signal DOA or the noise 
coherence is sufficient for estimation of the CDR. Employing 
the improved estimators for dereverberation has been shown 
to lead to improved dereverberation performance. Using the 
DOA-independent estimator, the proposed signal enhancement 
scheme constitutes a completely blind dereverberation system 
which requires no knowledge or estimation of the signal DOA. 

Appendix: Definition of the ELR 

Reverberant microphone signals Xi{t) can be written as 
a convolution of RIRs hi{t) with a clean signal d{t), i.e., 
Xi{t) = hi{t) * d{t). The RIRs can be split at f = Te into 
an early part containing direct path and early refiections, and 
a late part containing reverberation. To quantify the amount 
of reverberation in a signal, the early-to-late power ratio 
ELRt^ can then be defined as the power ratio between the 
components created by convolution with the early RIR, and 


(a) CDR estimator CDR^^q^2 
Room A Room B Room C 



DOA e DOA e DOA e 


(b) CDR estimator CDi^pj-opS (DOA-independent) 

Room A Room B Room C 





DOA e DOA e 


(c) CDR estimator (no noise coherence model) 

Room A Room B Room C 





DOA e DOA e 


—e— Im, processed —— 2m, processed — 0 — 4m, processed 

o Im, unprocessed x 2m, unprocessed <> 4m, unprocessed 


Figure 9. Average recognition rate for different rooms and source positions 
(I =1,2,4 m, 6 = —90 ... 90°), for unprocessed signals and signals processed 
by spatial magnitude averaging combined with coherence-based postfilters 
based on different CDR estimators. 



Figure 10. Time-averaged ELR^oms as function of frequency (room B, / = 
2 m, d = 8 cm), for unprocessed reverberant signal, and signal dereverberated 
using the proposed unbiased estimator 2. 


the reverberation components created by convolution with the 
late RIR, where Tq is set to an appropriate threshold, e.g., 
Te = 50 ms [20]. When is set to include only the direct 
path in the early component, the ELR is equivalent to the DRR. 
For the evaluation in this paper, the ELRt^ is computed for 
the unprocessed microphone signals, and for the signals at the 
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output of the signal enhancement system by processing the 
early and late signal components separately. 
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